Using multiple sampling strategies to estimate SARS-CoV-2 epidemiological parameters from genomic sequencing data

The choice of viral sequences used in genetic and epidemiological analysis is important as it can induce biases that detract from the value of these rich datasets. This raises questions about how a set of sequences should be chosen for analysis. We provide insights on these largely understudied problems using SARS-CoV-2 genomic sequences from Hong Kong, China, and the Amazonas State, Brazil. We consider multiple sampling schemes which were used to estimate Rt and rt as well as related R0 and date of origin parameters. We find that both Rt and rt are sensitive to changes in sampling whilst R0 and the date of origin are relatively robust. Moreover, we find that analysis using unsampled datasets result in the most biased Rt and rt estimates for both our Hong Kong and Amazonas case studies. We highlight that sampling strategy choices may be an influential yet neglected component of sequencing analysis pipelines.

Rhys et al. address an interesting and important question relating to the best sampling strategies for genomic sequences of SARS-CoV-2. Using sequences to estimate epidemiological parameters from genomic data requires determining how sequence sampling can influence these parameters. This is particularly important for resource-poor settings where sequencing capabilities are limited. They find that the sampling strategies can induce biases particularly in the estimates of Rt and rt that are sensitive to changes in sampling whilst R0 and the date of origin of a lineage are relatively robust to different sampling strategies.
The authors use the Jensen-Shannon Distance to compare the parameters obtained from the different genomic sampling strategies to those obtained using epidemiological data, however it is difficult to see how the different genomic sampling strategies compare to each other. For example, Figure 5 shows that the JD distance of the proportional sampling and uniform sampling to Epifilter are the same but this does not necessarily mean that the proportional sampling and uniform sampling are the same. Do sampling strategies show the same wrong pattern? I would suggest presenting the results as a matrix of pairwise JD distances so this information can be conveyed.
The manuscript also contains several misspellings, confusing figure captions and missing methodological information: Line 270 -what is defined as high quality complete genomes?
Line 284: which version of Pangolin software was used?
Line 285 selected for. => selected. Line 323: as was used => and was used Line 492: This is likely due to the Hong Kong datasets have a wider sampling interval => having Line 505: overlapped -> overlapping Seeing the confirmed cases from Figure 2 alongside the Rt estimated in Figure 4 would make comparison easier. In Fig4 it seems quite clear that Proportional is optimal -are the other three sampling strategies more similar to each other than they are to the proportional. In Figures 5-7, is the distance between each sampling strategy very distinct to the unsampled?
Line 640: than initial => than the initial Line 675: remove "through" Supplementary figure 1: Has this been limited to a specific geographical region? Amazonas only? Add the details to the caption.
Reply: We have amended the text and figure caption to improve clarity. The text now reads 'We found from using genomic data, Hong Kong had a posterior mean R0 estimate of 2.07 ( Figure 3A) across all sampling strategies. Using a proportional sampling strategy gave the highest posterior mean R0 estimate of 2.38 with the unsampled sampling strategy giving the lowest posterior mean R0 estimate of 1.87. Overall, Brazil had a higher posterior mean R0 estimate with a value of 2.24 ( Figure 3B) across all sampling strategies. The uniform sampling strategy yielded the highest posterior mean R0 estimate of 2.50 while the unsampled sampling strategy gave the lowest one of 1.82. Using case data, we found similarly found that Hong Kong had a lower R0 of 2.17 (95% credible interval (CI) = 1.43 -2.83) when compared to Amazonas which had a R0 of 3.67 (95% CI = 2.83 -4.48). All sampling schemes for both datasets were characterised by similar R0 values (Figure 3) indicating that the estimation of R0 is robust to changes in sampling scheme.' Moreover, with the figure caption of each violin plot we have stated that 'The central line represents the posterior mean estimate and intervals demarcate the 95% Highest Posterior Density Interval.' to improve clarification. Reply: We thank the reviewer for their comment and the ordering of the sampling schemes have been changed to reflect our figures. With respect to the figure E panels, we would prefer to keep the existing order to reflect the ranking from lowest to highest JSD as we feel this improves interpretation.

Reviewer #2 (Remarks to the Author):
Rhys et al. address an interesting and important question relating to the best sampling strategies for genomic sequences of SARS-CoV-2. Using sequences to estimate epidemiological parameters from genomic data requires determining how sequence sampling can influence these parameters. This is particularly important for resource-poor settings where sequencing capabilities are limited. They find that the sampling strategies can induce biases particularly in the estimates of Rt and rt that are sensitive to changes in sampling whilst R0 and the date of origin of a lineage are relatively robust to different sampling strategies.
We thank the Reviewer for the positive assessment of our work.
Query 1. The authors use the Jensen-Shannon Distance to compare the parameters obtained from the different genomic sampling strategies to those obtained using epidemiological data, however it is difficult to see how the different genomic sampling strategies compare to each other. For example, Figure 5 shows that the JD distance of the proportional sampling and uniform sampling to Epifilter are the same but this does not necessarily mean that the proportional sampling and uniform sampling are the same. Do sampling strategies show the same wrong pattern? I would suggest presenting the results as a matrix of pairwise JD distances so this information can be conveyed.
Reply: Thanks for this excellent suggestion, we hope that these additional analyses have improved the understanding and clarity of our study. To determine if the sampling strategies were showing the same wrong pattern, as suggested, we computed a matrix of pairwise JSD. These are now included in Figures 4-7 (panel F) for each pair of sampling strategies: e.g., unsampled vs proportional, proportion vs uniform, etc. We found that the unsampled sampling scheme was consistently distinct from all other sampling schemes, whilst the uniform and inverse sampling schemes were consistently the most similar. Also see our response to Query 12 below.
Query 2. Line 270 -what is defined as high quality complete genomes?